# Bidirectional Acculturation Measure (BAM): Replication Package

**Author:** Jessala A. Grijalva, PhD
Postdoctoral Fellow, Institute for Latino Studies, University of Notre Dame

## Overview

This repository contains the complete analysis pipeline and replication materials for two manuscripts:

1. **"The Myth of the Zero-Sum: A Bidirectional Acculturation Measure Using Gaussian Mixture Models"** (BAM) — submitted to *Politics, Groups, and Identities*
2. **"Making Cluster Analysis Inferential: A Bootstrap Validation Framework for Gaussian Mixture Models"** (ICA)

Both papers draw on the same analytic pipeline applied to the 2006 Latino National Survey (LNS). The BAM paper introduces a four-orientation acculturation typology derived from Gaussian Mixture Model (GMM) clustering. The ICA paper formalizes the bootstrap validation framework used to evaluate that solution.

## Data Source

**Latino National Survey, 2006**
ICPSR Study 20862
https://www.icpsr.umich.edu/web/ICPSR/studies/20862

The raw LNS data cannot be redistributed per ICPSR terms of use. To replicate from scratch, download file `20862-0003-Data.rda` from ICPSR and place it in `data/raw/`. However, all downstream analyses can be reproduced from the processed data files included in this repository (see below).

The survey codebook and questionnaire are available directly from ICPSR at the study URL above.

## File Structure

```
acculturation-bam-scale/
├── Phase_1_Preprocessing.qmd      # Data cleaning, imputation (MICE-PMM), variable construction
├── Phase_2_Analysis.qmd            # GMM clustering, bootstrap stability, bootstrap CIs, cross-validation
├── Manuscript_Figures.qmd          # Publication-ready figures and tables for both papers
├── Appendix.qmd                    # Shared online appendix (Sections A–F)
├── data/
│   ├── raw/                        # Raw LNS data (not included; see Data Source above)
│   └── processed/
│       ├── clean_data.rda          # Post-imputation dataset (Phase 1 output)
│       ├── lns_with_clusters_k4.rda        # Legacy Phase 1 cluster assignments
│       ├── bam_clustered_gmm_k4.rda        # Final clustered dataset with orientation labels (Phase 2 output)
│       └── phase2_validation_results.rda   # All validation metrics, bootstrap results, CIs (Phase 2 output)
├── figures/                        # High-resolution PNGs (300 and 600 DPI) for journal production
├── bam-scale.Rproj                 # RStudio project file
├── .gitignore
├── LICENSE
└── README.md
```

## Reproduction Instructions

### Quick Start (from processed data)

The processed `.rda` files are included, so you can reproduce all tables, figures, and appendix materials without the raw LNS data:

```r
# 1. Open bam-scale.Rproj in RStudio
# 2. Render the manuscript outputs:
quarto::quarto_render("Manuscript_Figures.qmd")
quarto::quarto_render("Appendix.qmd")
```

### Full Pipeline (from raw data)

If you have the raw LNS file from ICPSR:

```r
# 1. Place 20862-0003-Data.rda in data/raw/
# 2. Set Phase_1_Preprocessing.qmd execute: eval: true (default is false)
# 3. Render in order:
quarto::quarto_render("Phase_1_Preprocessing.qmd")
quarto::quarto_render("Phase_2_Analysis.qmd")
quarto::quarto_render("Manuscript_Figures.qmd")
quarto::quarto_render("Appendix.qmd")
```

**Important:** Phase 1 uses MICE-PMM imputation, which is stochastic. Re-running Phase 1 will produce slightly different imputed values (and therefore different cluster assignments) even with the same seed, due to differences in R version, package versions, and platform. The canonical imputed dataset is `clean_data.rda`. Phase 1 is set to `eval: false` by default to preserve this artifact.

## Analysis Summary

### Clustering

- **Algorithm:** Gaussian Mixture Model (Mclust, EEV covariance)
- **Clusters:** G = 4
- **Variables:** AMERICAN, CULTURAL_IDENTITY, KEEPSPAN, DISTINCT, LEARNENG (standardized)
- **Seed:** 2500

### Validation

- **Bootstrap stability:** B = 1,000 subsample refits (80% without replacement), ARI metric
- **Bootstrap CIs on cluster means:** B = 400, 95% percentile intervals
- **Bootstrap CIs on silhouette and Dunn index:** B = 400
- **Cross-validation:** 5-fold, predict() on held-out folds

### Four Acculturation Orientations

| Orientation | n | Heritage | American | Description |
|---|---|---|---|---|
| Culture Affirming | 433 | High | Low | Strong heritage identity, lower American identification |
| Assimilationist | 705 | Low | High | Strong American identity, lower cultural maintenance |
| Demicultural | 378 | Low | Low | Moderate on both dimensions |
| Bicultural | 3,269 | High | High | High on both heritage and American identity |

## Software Requirements

- R (>= 4.1.0)
- Quarto (>= 1.3)
- Key R packages: mclust, dplyr, tidyr, ggplot2, knitr, kableExtra, mice, e1071, dbscan, cluster, clValid

All package dependencies are loaded via `pacman::p_load()` at the top of each `.qmd` file.

## License

This work is licensed under the MIT License (code) and CC-BY 4.0 (data and documentation). See LICENSE for details.

## Contact

Jessala A. Grijalva, PhD
Postdoctoral Fellow, Institute for Latino Studies
University of Notre Dame
jgrijal2@nd.edu
